*Stata program to sort merge and save LAER excel data in Stata format
*06 Nov 2018

*1. BE SURE THAT A COPY OF THE MOST CURRENT EXCEL DATA FILE HAS BEEN RENAMED: LAEP current version.xlsx
*   and be sure that the "Assistant" variable is numeric, not alphabetic

*2. UPDATE THE FOLLOWING ITEM EACH TIME YOU USE THIS PROGRAM
*Replace the date quotation marks with today's date
global date "06 11 2018"
*This creates a macro containing today's date, so log files and created data files will not overwrite other days' results

*3. CHANGE THE WORKING DIRECTORY to your data location
*Here is Schaefer's directory command. Comment it out with a * and insert your own cd command line
cd "\\Acadia\home\users\schk\My Documents\Liberal Arts colleges and research in economics\Data\"

*4. Some variables in Excel were strings; I recoded them as numeric, but MaxPreTenureLeave isn't working
*Old code:
*destring maximumpossiblepretenureleave, generate(MaxPretenLeave)
*destring externallyfundedgrants, generate(ExtGrants)
*destring Yearsashead, generate(YrsHead)
*For now, I drop a troublesome variable that won't destring at beginning of program with 
* drop maximumpossiblepretenureleave




*The program to create the flat file begins now:

*1. Open the Excel data file, and save its three primary sheets as separate, sorted Stata files.

import excel "LAEP current version.xlsx", sheet("Professor attributes master") firstrow
sort ProfessorIDnumber
gen profmerge=_n
save "professor attributes sorted on profmerge $date.dta", replace


import excel "LAEP current version.xlsx", sheet("Institutional attributes") firstrow clear
sort InstitutionID
save "institution attributes sorted on InstitutionID $date.dta", replace

import excel "LAEP current version.xlsx", sheet("Identification info") firstrow clear
sort ProfessorIDnumber
gen profmerge=_n
save "identification info sorted on profmerge $date.dta", replace

*2. Merge the three files into a single flat file:

merge 1:1 profmerge using "professor attributes sorted on profmerge $date.dta"
drop _merge
sort InstitutionID
save "professor and identification data sorted on InstitutionID $date.dta", replace

merge m:1 InstitutionID using "institution attributes sorted on InstitutionID $date.dta"
drop _merge
drop profmerge

*Add variable labels for clarity:

label variable Gender "Gender, 1=female"
label variable RAappointmentsingradschool "Grad school RA appointment, 1=Yes"
label variable averagequalityofjournal "1-3, 3 is best"
label variable ShareofAarticles "Articles in best / A / rating==3 journals"
label variable Averageimpactfactorofjournal "smaller # is better"
label variable Teachingload "classes per semester"
label variable Percentpay "% pay received on leave/sabbatical"


*save data file
save "merged flatfile sorted on InstitutionID $date.dta", replace



*****************************

*The program to create new variables begins now:

*****************************

*For now, drop a troublesome variable that won't destring
drop maximumpossiblepretenureleave

*make conference support arbitrary $3000 for NHFL level
  replace travelandconferencefundinglev="3000" if travelandconferencefundinglev=="NHFL"
*Convert travelandconferencefundinglev to numeric, generate(ConfSupport)
  destring travelandconferencefundinglev, generate(ConfSupport)

*drop unused and alpha variables
  drop Professor ProfessorIDnumber Institution InstitutionID Assistant Sourceofinfovita U AL AM 
  drop THECOUNTOFPROFSFROMUNRANKED Institutionname Schoolaccreditor M Teachingloaddone 
  drop travelandconferencefundinglev Fullpolicynotfordataentry

*replace -99 missing values with .
ds
local names = r(varlist)
foreach var in `names'{
replace `var'=. if `var'==-99
} 




*1.Generate ATTRIBUTE (RHS match) variables

*Use Gender Full Postdoc as-is
gen Assoc=Associate
gen TsincePhD=YearssincePHD
gen RA=RAappointmentsingradschool
gen Npubs3=Numberofpublicationsinthreey
gen PhDRank=USNWRRankingofPhDprogram
gen PhDQual=PhDRank/Npubs3
*generate a major field dummy, =1 for theory metrics and international==2, 3, 4, 10
gen Thyfield=0
  replace Thyfield=1 if Majorfield==2
  replace Thyfield=1 if Majorfield==3
  replace Thyfield=1 if Majorfield==4
  replace Thyfield=1 if Majorfield==10


*2.Generate TREATMENT (RHS) variables

*Apparently exogenous treatments
gen Sabbat=maximumpossiblesabbaticalswhe
gen Teach=Teachingload
gen NEcon=colleaguesineconDeptnot
gen NField=colleagueswithoverlappingfie

*Potentially endogenous treatments
gen Chair=Departmenthead
gen URank=Rankofschool
gen BusAccred=Econinbusinessschoolmergedde - Businessschoolaccreditationsep
  label variable BusAccred "Econ Dept included in Bus accred"
gen UClassSize=Typicalclasssizecollegewide

*Generate comprehensive family leave variables
gen PdFamLeave=Weekspaidfamilyleave*Percentpay
gen TotFamLeave=Weekspaidfamilyleave*Percentpay + (.3*Weeksadditionalunpaidfamilyle)

*Potentially endogenous treatments with smaller N
gen Admin=Associatedeanmajoradministrato + Majoruniversityservice + (cmteassignments/3)


*3.Generate primary OUTCOME (LHS) variables
gen NArt= Numberofpublications
gen NArtAuth= Numberofpublications/ averagecoauthors
gen NArtQual= Numberofpublications* averagequalityofjournal
gen NArtQualAuth= Numberofpublications* averagequalityofjournal/ averagecoauthors
  label variable NArt "Number of articles published"
  label variable NArtAuth "Number of articles weighted by #coauthors"
  label variable NArtQual "Number of articles weighted by qualtiy"
  label variable NArtQualAuth "Number of articles weighted by quality and #coauthors"

*4. Save new flat file
save "augmented flatfile sorted on InstitutionID $date.dta", replace


*The program to find the treatment effects begins now
*See Stata teffects manual for details

*1. Naive OLS regressions

*OLS with normal distribution of error term
reg NArt RA PhDRank Npubs3 Thyfield Postdoc TsincePhD Gender Sabbat Teach NEcon NField Chair URank BusAccred UClassSize ConfSupport PdFamLeave
*Add Admin, though it shrinks sample size
reg NArt RA PhDRank Npubs3 Thyfield Postdoc TsincePhD Gender Sabbat Teach NEcon NField Chair URank BusAccred UClassSize ConfSupport PdFamLeave Admin

*OLS with Poisson distribution, because LHS is a count variable (note major changes to p-values)
poisson NArt RA PhDRank Npubs3 Thyfield Postdoc TsincePhD Gender Sabbat Teach NEcon NField Chair URank BusAccred UClassSize ConfSupport PdFamLeave
poisson NArt RA PhDRank Npubs3 Thyfield Postdoc TsincePhD Gender Sabbat Teach NEcon NField Chair URank BusAccred UClassSize ConfSupport PdFamLeave Admin



*2. teffects regressions 

*Generate observation numbers and nearest-neighbor matching observation numbers in variable nn_1
gen obs_id = _n
teffects nnmatch (NArt RA PhDRank Npubs3 Thyfield Postdoc TsincePhD Gender ) (Full), generate(nn_)
drop nn_2

*Generate difference variables: differences between nearest neighbors in outcome, attribute, and treatment variables
gen NNArt=NArt[nn_1]
gen DNArt= NArt- NNArt

gen NRA   = RA[nn_1]
gen DRA   = RA- NRA

gen NPhDRank   = PhDRank[nn_1]
gen DPhDRank   = PhDRank- NPhDRank

gen NNpubs3   = Npubs3[nn_1]
gen DNpubs3   = Npubs3- NNpubs3

gen NThyfield   = Thyfield[nn_1]
gen DThyfield   = Thyfield- NThyfield

gen NPostdoc   = Postdoc[nn_1]
gen DPostdoc   = Postdoc- NPostdoc

gen NTsincePhD   = TsincePhD[nn_1]
gen DTsincePhD   = TsincePhD- NTsincePhD

gen NGender   = Gender[nn_1]
gen DGender   = Gender- NGender

gen NSabbat   = Sabbat[nn_1]
gen DSabbat   = Sabbat- NSabbat

gen NTeach   = Teach[nn_1]
gen DTeach   = Teach- NTeach

gen NNEcon   = NEcon[nn_1]
gen DNEcon   = NEcon- NNEcon

gen NNField   = NField[nn_1]
gen DNField   = NField- NNField

gen NChair   = Chair[nn_1]
gen DChair   = Chair- NChair

gen NURank   = URank[nn_1]
gen DURank   = URank- NURank

gen NBusAccred   = BusAccred[nn_1]
gen DBusAccred   = BusAccred- NBusAccred

gen NUClassSize   = UClassSize[nn_1]
gen DUClassSize   = UClassSize- NUClassSize

gen NConfSupport   = ConfSupport[nn_1]
gen DConfSupport   = ConfSupport- NConfSupport

gen NPdFamLeave   = PdFamLeave[nn_1]
gen DPdFamLeave   = PdFamLeave- NPdFamLeave

gen NAdmin   = Admin[nn_1]
gen DAdmin   = Admin- NAdmin

save "NNmatched flatfile sorted on InstitutionID $date.dta", replace


*OLS dif-in-dif regression of differences in outcome on differences in treatments
reg DNArt DSabbat DTeach DNEcon DNField DChair DURank DBusAccred DUClassSize DConfSupport DPdFamLeave
reg DNArt DSabbat DTeach DNEcon DNField DChair DURank DBusAccred DUClassSize DConfSupport DPdFamLeave DAdmin

*Multivariate teffects regression using PCA of treatment variables signif in all regressions as a single treatment
pca  NEcon NField Chair URank
predict Pctreatments
xtile PcDeciles = Pctreatments, nq(10)

teffects ra (NArt RA PhDRank Npubs3 Thyfield Postdoc TsincePhD Gender, poisson) (PcDeciles)
teffects ipw (NArt RA PhDRank Npubs3 Thyfield Postdoc TsincePhD Gender, poisson) (PcDeciles)

*3. etpoisson regressions to account for potential endogeneity 
*Appears to not be possible in Stata for multi-valued treatments
